MEDB 5510, Module 11, Data management

Topics to be covered

  • What you will learn
    • Data dictionary
    • Missing values
    • Dates
    • Categorical data
    • Longitudinal and repeated measures data
    • Excel files
    • Text files
    • Databases and double entry

Data management

  • Data dictionary
    • Variable names, Variable labels, Value labels, Missing value codes
  • Managing complex files
    • Multiple response, Longitudinal/repeated measures data
  • Storage options
    • Spreadsheet, Text file, Database, REDCap

Data dictionary

  • Also called a code book
  • Start before collecting data
  • Revise as needed

Contents of a data dictionary

  • Variable names
  • Variable labels
  • Units of measurement
  • Permissible/impermissible values
  • Value labels
  • Missing value codes
  • Source
  • License

Variable names

  • Brief, but descriptive explanation
  • Roughly 4 to 16 characters
  • No blanks and (almost) no symbols
  • One to three words

Good and bad variable names

  • Names to avoid (www.writersexchange.com)
    • systolic blood pressure, systolic-blood-pressure
  • Names that work
    • systolic_blood_pressure, systolic.blood.pressure, SystolicBloodPressure
  • NEVER USE ALL CAPS FOR VARIABLE NAMES
    • Lower case has ascenders (e.g., f, l) and descenders (e.g., g, y)

Variable labels

  • Longer descriptions
    • Can include spaces and punctuation
    • Ideal length is 20-40 characters
    • Mention units of measurement, special qualifiers

Break #1

  • What you have learned
    • Data dictionary
  • What’s coming next
    • Missing values

Missing value codes

  • Explain WHY the value is missing
  • For a survey
    • Did not answer
    • Not applicable
  • For a lab result
    • Below the limit of detection
    • Insufficient volume for testing
    • Dropped the test tube and it shattered making a huge mess

Example of missing value codes

  • Use extreme number code
    • 9, 99, 999
    • -1
  • Use symbols
    • NA
    • (asterisk)
    • (dot)
  • Never use blanks to designate missing
  • Note missing value code on data dictionary

Missing value example

Break #2

  • What you have learned
    • Missing values
  • What’s coming next
    • Dates

Date formats

Internal storage formats

  • Excel - number of days since 1899-12-31
    • Actually January 0, 1900 (1900-01-00)!
  • R - number of days since January 1, 1970
  • SAS - number of days since January 1, 1960
  • SPSS - number of seconds since October 14, 1582

Gregorian calendar

Gregorian calendar

Painting of Pope Gergory XIII

Break #3

  • What you have learned
    • Dates
  • What’s coming next
    • Categorical data

Categorical values

  • Definition: small number of possible values
  • Beware of ambiguities
    • YES, yes, and Yes are three distinct levels.
  • Use number codes
    • 0, 1, 9 for binary variables
  • Single letter codes
    • M, F, and U for gender
    • Potentially ambiguous
    • Consistent case is important.

Example of ambiguous coding

Reverse coding (1 of 2)

  • Context specific
  • Sequence of IF THEN ELSE statements
if (is.na(x)) then y=NA
  else if (x=1) then y=4
  else if (x=2) then y=3
  else if (x=3) then y=2
  else if (x=4) then y=1
  else y=9

Reverse coding (2 of 2)

  • Functional transformations
0,1 to 1,0

 x    y=1-x

 0      1
 1      0
 
 1,2,3,4 to 4,3,2,1   but  0,1,2,3,4 to 4,3,2,1,0
 
  x          y=5-x          x            y=4-x
  1            4            0              4
  2            3            1              3
  3            2            2              2
  4            1            3              1
                            4              0
  • Always check your results
  • Watch out for missing value codes

A multiple response example

Coding with a single string

Coding using single letters

Break #4

  • What you have learned
    • Categorical data
  • What’s coming next
    • Longitudinal and repeated measures data

Longitudinal data, Repeated measures data

  • Tall and thin format
    • One line per visit/measurement
  • Short and fat format
    • One line per patient

Example of tall/thin, dictionary

Example of short/fat, dictionary

Repeated measures example in short and fat format

Example of tall/thin, data

Example of short/fat, data

A simple alternative to both tall/thin and short/fat

  • Disadvantages of tall/thin
    • Too much repetition
  • Disadvantages of short/fat
    • Hard to read
  • Database format
    • Time constant table
    • Time varying table

Time constant data

Time varying data

Break #5

  • What you have learned
    • Longitudinal and repeated measures data
  • What’s coming next
    • Excel files

Spreadsheet software

  • Examples
    • Microsoft Excel
    • Macintosh Numbers
    • Libre Office Calc
    • Google Sheets
  • Not recommended

If you enter data into Excel

  • Do not use colors
  • Do not include summary statistics
  • Rectangular grid
  • Don’t squeeze two data values into one cell
    • Systolic/diastolic blood pressures
    • 44M for a 44 year old male
  • Variable names in first row
  • No blank cells
    • Contradicts your book

A poorly structured spreadsheet

Revisions to this spreadsheet

The codebook from this spreadsheet

Break #6

  • What you have learned
    • Excel files
  • What’s coming next
    • Text files

Text files

  • Fixed width
  • Delimited
    • Commas
    • Spaces
    • Tabs
    • “Quotes around text”

Data dictionary for aboriginal prison death study

Comma separated values (csv)

Comma separated values with quotes

Fixed width format

Spaced format

Data set using a spaced format

Tab separated values

Break #7

  • What you have learned
    • Text files
  • What’s coming next
    • Databases and double entry

Database systems

  • Terminology
    • Tables
    • Fields
    • Records
    • Primary key
    • Foreign key

Database table for value labels

Database linkage between race code and race labels

REDCap

  • Research Electronic Data Capture
  • Not open source, but freely distributed by Vanderbilt
  • Software components
    • PHP
    • JavaScript
    • MySQL
  • Case report forms
  • Strongly recommended

Double entry coding

  • Great quality check
    • If you can afford it
  • Prepare a code book first
    • Count the proportion of discrepancies
  • If too many discrepancies
    • Revise the code book and re-do the data entry.
  • If discrepancies small enough
    • Report this number in your publication

Summary

  • What you have learned
    • Data dictionary
    • Missing values
    • Dates
    • Categorical data
    • Longitudinal and repeated measures data
    • Excel files
    • Text files
    • Databases and double entry